testingiauh912fandomcom-20200214-history
Reliability test requirements
Rel 'Marzieh eftekhari Reliability test requirements ''' '''Reliability test requirements can follow from any analysis for which the first estimate of failure probability, failure mode or effect needs to be justified. Evidence can be generated with some level of confidence by testing. With software-based systems, the probability is a mix of software and hardware-based failures. Testing reliability requirements is problematic for several reasons. A single test is in most cases insufficient to generate enough statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impractical, and environmental conditions can be hard to predict over a systems life-cycle. Reliability engineering is used to design a realistic and affordable test program that provides empirical evidence that the system meets its reliability requirements. Statistical confidence levels are used to address some of these concerns. A certain parameter is expressed along with a corresponding confidence level: for example, an MTBF of 1000 hours at 90% confidence level. From this specification, the reliability engineer can, for example, design a test with explicit criteria for the number of hours and number of failures until the requirement is met or failed. Different sorts of tests are possible. The combination of required reliability level and required confidence level greatly affects the development cost and the risk to both the customer and producer. Care is needed to select the best combination of requirements – e.g. cost-effectiveness. Reliability testing may be performed at various levels, such as component, subsystem and system. Also, many factors must be addressed during testing and operation, such as extreme temperature and humidity, shock, vibration, or other environmental factors (like loss of signal, cooling or power; or other catastrophes such as fire, floods, excessive heat, physical or security violations or other myriad forms of damage or degradation). For systems that must last many years, accelerated life tests may be needed. A Reliability Sequential Test Plan The purpose of reliability testing is to discover potential problems with the design as early as possible and, ultimately, provide confidence that the system meets its reliability requirements. Reliability testing may be performed at several levels and there are different types of testing. Complex systems may be tested at component, circuit board, unit, assembly, subsystem and system levels. (The test level nomenclature varies among applications.) For example, performing environmental stress screening tests at lower levels, such as piece parts or small assemblies, catches problems before they cause failures at higher levels. Testing proceeds during each level of integration through full-up system testing, developmental testing, and operational testing, thereby reducing program risk. However, testing does not mitigate unreliability risk. With each test both a statistical type 1 and type 2 error could be made and depends on sample size, test time, assumptions and the needed discrimination ratio. There is risk of incorrectly accepting a bad design (type 2 error) and the risk of incorrectly rejecting a good design (type 1 error). It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; some failure modes may take years to observe; some complex interactions result in a huge number of possible test cases; and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing can be used, such as (Highly) accelerated life testing, design of experiments, and simulations. The desired level of statistical confidence also plays an role in reliability testing. Statistical confidence is increased by increasing either the test time or the number of items tested. Reliability test plans are designed to achieve the specified reliability at the specified confidence level with the minimum number of test units and test time. Different test plans result in different levels of risk to the producer and consumer. The desired reliability, statistical confidence, and risk levels for each side influence the ultimate test plan. The customer and developer should agree in advance on how reliability requirements will be tested. A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are many situations where it is not clear whether a failure is really the fault of the system. Variations in test conditions, operator differences, weather and unexpected situations create differences between the customer and the system developer. One strategy to address this issue is to use a scoring conference process. A scoring conference includes representatives from the customer, the developer, the test organization, the reliability organization, and sometimes independent observers. The scoring conference process is defined in the statement of work. Each test case is considered by the group and "scored" as a success or failure. This scoring is the official result used by the reliability engineer. As part of the requirements phase, the reliability engineer develops a test strategy with the customer. The test strategy makes trade-offs between the needs of the reliability organization, which wants as much data as possible, and constraints such as cost, schedule and available resources. Test plans and procedures are developed for each reliability test, and results are documented. Accelerated testing The purpose of accelerated life testing (ALT test) is to induce field failure in the laboratory at a much faster rate by providing a harsher, but nonetheless representative, environment. In such a test, the product is expected to fail in the lab just as it would have failed in the field—but in much less time. The main objective of an accelerated test is either of the following: · To discover failure modes · To predict the normal field life from the high stress lab life An Accelerated testing program can be broken down into the following steps: · Define objective and scope of the test · Collect required information about the product · Identify the stress(es) · Determine level of stress(es) · Conduct the accelerated test and analyze the collected data. Common way to determine a life stress relationship are · Arrhenius Model · Eyring Model · Inverse Power Law Model · Temperature-Humidity Model · Temperature Non-thermal Model Software reliability Software reliability is a special aspect of reliability engineering. System reliability, by definition, includes all parts of the system, including hardware, software, supporting infrastructure (including critical external interfaces), operators and procedures. Traditionally, reliability engineering focuses on critical hardware parts of the system. Since the widespread use of digital integrated circuit technology, software has become an increasingly critical part of most electronics and, hence, nearly all present day systems. There are significant differences, however, in how software and hardware behave. Most hardware unreliability is the result of a component or material failure that results in the system not performing its intended function. Repairing or replacing the hardware component restores the system to its original operating state. However, software does not fail in the same sense that hardware fails. Instead, software unreliability is the result of unanticipated results of software operations. Even relatively small software programs can have astronomically large combinations of inputs and states that are infeasible to exhaustively test. Restoring software to its original state only works until the same combination of inputs and states results in the same unintended result. Software reliability engineering must take this into account. Despite this difference in the source of failure between software and hardware, several software reliability models based on statistics have been proposed to quantify what we experience with software: the longer software is run, the higher the probability that it will eventually be used in an untested manner and exhibit a latent defect that results in a failure (Shooman 1987), (Musa 2005), (Denney 2005). As with hardware, software reliability depends on good requirements, design and implementation. Software reliability engineering relies heavily on a disciplined software engineering process to anticipate and design against unintended consequences. There is more overlap between software quality engineering and software reliability engineering than between hardware quality and reliability. A good software development plan is a key aspect of the software reliability program. The software development plan describes the design and coding standards, peer reviews, unit tests, configuration management, software metrics and software models to be used during software development. A common reliability metric is the number of software faults, usually expressed as faults per thousand lines of code. This metric, along with software execution time, is key to most software reliability models and estimates. The theory is that the software reliability increases as the number of faults (or fault density) goes down. Establishing a direct connection between fault density and mean-time-between-failure is difficult, however, because of the way software faults are distributed in the code, their severity, and the probability of the combination of inputs necessary to encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability engineer. Other software metrics, such as complexity, are also used. This metric remains controversial, since changes in software development and verification practices can have dramatic impact on overall defect rates. Testing is even more important for software than hardware. Even the best software development process results in some software faults that are nearly undetectable until tested. As with hardware, software is tested at several levels, starting with individual units, through integration and full-up system testing. Unlike hardware, it is inadvisable to skip levels of software testing. During all phases of testing, software faults are discovered, corrected, and re-tested. Reliability estimates are updated based on the fault density and other metrics. At a system level, mean-time-between-failure data can be collected and used to estimate reliability. Unlike hardware, performing exactly the same test on exactly the same software configuration does not provide increased statistical confidence. Instead, software reliability uses different metrics, such as code coverage. Eventually, the software is integrated with the hardware in the top-level system, and software reliability is subsumed by system reliability. The Software Engineering Institute's Capability Maturity Model is a common means of assessing the overall software development process for reliability and quality purposes. Reliability engineering vs safety engineering Reliability engineering differs from safety engineering with respect to the kind of hazards that are considered. Reliability engineering is in the end only concerned with cost. It relates to all Reliability hazards that could transform into incidents with a particular level of loss of revenue for the company or the customer. These can be cost due to loss of production due to system unavailability, unexpected high or low demands for spares, repair costs, man hours, (multiple) re-designs, interruptions on normal production (e.g. due to high repair times or due to unexpected demands for non-stocked spares) and many other indirect costs. Safety engineering, on the other hand, is more specific and regulated. It relates to only very specific and system Safety Hazards that could potentially lead to severe accidents. The related functional reliability requirements are sometimes extremely high. It deals with unwanted dangerous events (for life and environment) in the same sense as reliability engineering, but does normally not directly look at cost and is not concerned with repair actions after failure / accidents (on system level). Another difference is the level of impact of failures on society and the control of governments. Safety engineering is often strictly controlled by governments (e.g. Nuclear, Aerospace, Defense, Rail and Oil industries). Furthermore, safety engineering and reliability engineering may even have contradicting requirements. This relates to system level architecture choices[[http://en.wikipedia.org/wiki/Wikipedia:Citation_needed citation needed]]. For example, in train signal control systems it is common practice to use a fail-safe system design concept. In this concept the so-called "wrong side failures" need to be fully controlled to an extreme low failure rate. These failures are related to possible severe effects, like frontal collisions (2* GREEN lights). Systems are designed in a way that the far majority of failures will simply result in a temporary or total loss of signals or open contacts of relays and generate RED lights for all trains. This is the safe state. All trains are stopped immediately. This fail-safe logic might unfortunately lower the reliability of the system. The reason for this is the higher risk of false tripping as any full or temporary, intermittent failure is quickly latched in a shut-down (safe)state. Different solutions are available for this issue. See chapter Fault Tolerance below. Fault tolerance Reliability can be increased here by using a 2oo2 (2 out of 2) redundancy on part or system level, but this does in turn lower the safety levels (more possibilities for Wrong Side and undetected dangerous Failures). Fault tolerant voting systems (e.g. 2oo3 voting logic) can increase both reliability and safety on a system level. In this case the so-called "operational" or "mission" reliability as well as the safety of a system can be increased. This is also common practice in Aerospace systems that need continued availability and do not have a fail safe mode (e.g. flight computers and related electrical and / or mechanical and / or hydraulic steering functions need always to be working. There are no safe fixed positions for rudder or other steering parts when the aircraft is flying). Basic reliability and mission (operational) reliability The above example of a 2oo3 fault tolerant system increases both mission reliability as well as safety. However, the "basic" reliability of the system will in this case still be lower than a non redundant (1oo1) or 2oo2 system! Basic reliability refers to all failures, including those that might not result in system failure, but do result in maintenance repair actions, logistic cost, use of spares, etc. For example, the replacement or repair of 1 channel in a 2oo3 voting system that is still operating with one failed channel (which in this state actually has become a 1oo2 system) is contributing to basic unreliability but not mission unreliability. Also, for example, the failure of the taillight of an aircraft is not considered as a mission loss failure, but does contribute to the basic unreliability. Detectability and common cause failures When using fault tolerant (redundant architectures) systems or systems that are equipped with protection functions, detectability of failures and avoidance of common cause failures become paramount for safe functioning and/or mission reliability. Reliability operational assessment After a system is produced, reliability engineering monitors, assesses and corrects deficiencies. Monitoring includes electronic and visual surveillance of critical parameters identified during the fault tree analysis design stage. The data are constantly analyzed using statistical techniques, such as Weibull analysis and linear regression, to ensure the system reliability meets requirements. Reliability data and estimates are also key inputs for system logistics. Data collection is highly dependent on the nature of the system. Most large organizations have quality control groups that collect failure data on vehicles, equipment and machinery. Consumer product failures are often tracked by the number of returns. For systems in dormant storage or on standby, it is necessary to establish a formal surveillance program to inspect and test random samples. Any changes to the system, such as field upgrades or recall repairs, require additional reliability testing to ensure the reliability of the modification. Since it is not possible to anticipate all the failure modes of a given system, especially ones with a human element, failures will occur. The reliability program also includes a systematic root cause analysis that identifies the causal relationships involved in the failure such that effective corrective actions may be implemented. When possible, system failures and corrective actions are reported to the reliability engineering organization. One of the most common methods to apply to a reliability operational assessment are Failure Reporting, Analysis and Corrective Action Systems (FRACAS). This systematic approach develops a reliability, safety and logistics assessment based on Failure / Incident reporting, management, analysis and corrective/preventive actions. Organizations today are adopting this method and utilize commercial systems such as a Web based FRACAS application enabling an organization to create a failure/incident data repository from which statistics can be derived to view accurate and genuine reliability, safety and quality performances. It is extremely important to have one common source FRACAS system for all end items. Also, test results should be able to be captured here in a practical way. Failure to adopt one easy to handle (easy data entry for field engineers and repair shop engineers)and maintain integrated system is likely to result in a FRACAS program failure. Some of the common outputs from a FRACAS system includes: Field MTBF, MTTR, Spares Consumption, Reliability Growth, Failure/Incidents distribution by type, location, part no., serial no, symptom etc. The use of past data to predict the reliability of new comparable systems/items can be misleading as reliability is a function of the context of use and can be affected by small changes in the designs/manufacturing. Reliability organizations Systems of any significant complexity are developed by organizations of people, such as a commercial company or a government agency. The reliability engineering organization must be consistent with the company's organizational structure. For small, non-critical systems, reliability engineering may be informal. As complexity grows, the need arises for a formal reliability function. Because reliability is important to the customer, the customer may even specify certain aspects of the reliability organization. There are several common types of reliability organizations. The project manager or chief engineer may employ one or more reliability engineers directly. In larger organizations, there is usually a product assurance or specialty engineering organization, which may include reliability, maintainability, quality, safety, human factors, logistics, etc. In such case, the reliability engineer reports to the product assurance manager or specialty engineering manager. In some cases, a company may wish to establish an independent reliability organization. This is desirable to ensure that the system reliability, which is often expensive and time consuming, is not unduly slighted due to budget and schedule pressures. In such cases, the reliability engineer works for the project day-to-day, but is actually employed and paid by a separate organization within the company. Because reliability engineering is critical to early system design, it has become common for reliability engineers, however the organization is structured, to work as part of an integrated product team. ''' '''References 1. ^ Institute of Electrical and Electronics Engineers (1990) IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York, NY ISBN 1-55937-079-3 2. ^ [http://en.wikipedia.org/wiki/Reliability_engineering#cite_ref-lambdaconsulting.co.za_2-0 a''] [http://en.wikipedia.org/wiki/Reliability_engineering#cite_ref-lambdaconsulting.co.za_2-1 ''b] http://lambdaconsulting.co.za/rwa%20barnard%20incose%202008.pdf 3. ^ [http://en.wikipedia.org/wiki/Reliability_engineering#cite_ref-O.27Connor_2002_3-0 a''] [http://en.wikipedia.org/wiki/Reliability_engineering#cite_ref-O.27Connor_2002_3-1 ''b] O'Connor, Patrick D. T. (2002), Practical Reliability Engineering (Fourth Ed.), John Wiley & Sons, New York. ISBN 978-0-4708-4462-5. 4. ^ http://www.faa.gov/library/manuals/aviation/risk_management/ss_handbook/ 'iability test requirements '